NORMALIZATION OF KAZAKH LANGUAGE WORDS
Annotation
Subject of Research. Models and existing algorithms for normalization of natural language words are considered. The paper describes algorithms for automatic selection of the basic principles for a number of natural languages and possible ways of the normal word form synthesis for the Kazakh language. The research is aimed at creation of a complete classification for the Kazakh language ending system and development of a normalization algorithm for words based on the proposed classification approach for endings and suffixes. Method. Word formation analysis by applying endings for all Kazakh language parts of speech was carried out; a classification of endings and suffixes was presented. The paper discusses all kinds of placement options for endings and suffixes. The total number of various suffixes is 26 526 units and the endings is 3 565 units. All considered types are lexically and semantically valid, but some of them are not applicable. Only those, that are most commonly used, are added to the affix base. The order, that the affixes are added to the base, is presented using sets. Thus, the base is correctly selected. The study does not examine word-forming suffixes, as they change the word stem and contextual interpretation. Basically, word-forming suffixes are added to nouns. Main Results. A complete classification system for endings and suffixes of the Kazakh language has been developed. Deterministic finite automata for various parts of speech are created using all possible options, adding suffixes and endings, taking into account the morphological and lexical features of the Kazakh language grammar. A lexicon-free stemming algorithm is developed using the proposed classification system for endings of the Kazakh language. A normalization system has been implemented, proving the operability of the developed algorithm without a dictionary. The algorithm implementation was tested on the Kazakh language corpus. Punctuation and stop words were initially removed from the specified corpus. Practical Relevance. The results of the work can find application in the text analysis and normalization (lemmatization), as well as in information retrieval systems, in machine translation from the Kazakh language, and other applied problems.
Keywords
Постоянный URL
Articles in current issue
- READER’S NOTE
- BLOCKCHAIN TECHNOLOGY IN 5G NETWORKS
- LASER HEATING NUMERICAL SIMULATION OF TITANIUM-CONTAINING OPTOTHERMAL FIBER CONVERTER AND VEIN WALL DURING ENDOVASALLASER COAGULATION
- EXPERIMENTAL STUDY OF OPTICAL POWER EFFECT ON THE STRUCTURE OF FIBER-OPTICAL DIFFUSER OBTAINED BY FIBER CORE MELTING
- METHODS OF HOT WIRE CREATION FOR FIBER-OPTICAL THERMAL ANEMOMETER
- EFFECT OF EXTERNAL RELATIVE PRESSURE ON PHASE SHIFT IN SAGNAC INTERFEROMETER
- ABSORPTION CHARACTERISTICS OF SILVER ION-EXCHANGED LAYERS IN CHLORIDE PHOTO-THERMO-REFRACTIVE GLASS
- ERBIUM SPECTRAL-LUMINESCENT CHARACTERISTICS IN BROMIDE-FLUORIDE PHOTO-THERMO-REFRACTIVE GLASSES
- AUTOMATED HANDDETECTION METHOD FOR TASKS OF GESTURE RECOGNITION IN HUMAN-MACHINE INTERFACES
- MODERN APPROACHES TO MULTICLASS INTENT CLASSIFICATION BASED ON PRE-TRAINED TRANSFORMERS
- EFFECTIVE IMPLEMENTATION OF MODERN MCELIECE CRYPTOSYSTEM ON GENERALIZED (L,G)-CODES
- IMAGE-BASED DEFECT ANALYSIS FOR 3D-PRINTED ITEM SURFACE USING MACHINE LEARNING METHODS
- MODEL OF AUTOMATED SYNTHESIS TOOL FOR HARDWARE ACCELERATORS OF CONVOLUTIONAL NEURAL NETWORKS FORPROGRAMMABLE LOGIC DEVICES
- PACKET RESERVATIONS IN REAL-TIME MULTIPATH TRANSMISSIONS
- MODULAR APPROACH APPLICATION IN DEVELOPMENT OF COMPUTER NUMERICAL CONTROL SOFTWARE
- PARAMETRIC IDENTIFICATION OF DIFFERENTIAL-DIFFERENCE MODELS OF HEAT TRANSFER IN ONE-DIMENSIONAL BODIES BASED ON KALMAN FILTER ALGORITHMS
- SELECTION OF COMPOSITE MATERIAL IN ELECTROMAGNETIC LOG SENSOR
- MODELING OF LIQUEFIED NATURAL GAS EVAPORATION IN MOBILE RESERVOIRS
- MATHEMATICAL MODEL OF LIQUEFIED NATURAL GAS EVAPORATION AND ANALYSIS OF ORIGINAL COMPOSITION EFFECT ON EVAPORATION SPEED
- MESHLESS MODELING OF ELASTIC DEFORMATIONS OF POLYMERIC COMPOSITE MATERIALS UNDER STATIC LOADING
ACCURACY INCREASE OF SOFTWARE AND HARDWARE APPLIANCE FOR MUSCLE ACTIVITY MEASURING AND MONITORING BY FILTRATION OF CARRIER COMPONENT AND FREQUENCIES HIGHER THAN MEASURED SIGNAL RANGE